Search CORE

10 research outputs found

Towards the use of mini-applications in performance prediction and optimisation of production codes

Author: Owenson A. M. B
Publication venue
Publication date
Field of study

Maintaining the performance of large scientific codes is a difficult task. To aid in this task a number of mini-applications have been developed that are more tract able to analyse than large-scale production codes, while retaining the performance characteristics of them. These “mini-apps” also enable faster hardware evaluation, and for sensitive commercial codes allow evaluation of code and system changes outside of access approval processes. Techniques for validating the representativeness of a mini-application to a target code are ultimately qualitative, requiring the researcher to decide whether the similarity is strong enough for the mini-application to be trusted to provide accurate predictions of the target performance. Little consideration is given to the sensitivity of those predictions to the few differences between the mini-application and its target, how those potentially-minor static differences may lead to each code responding very differently to a change in the computing environment. An existing mini-application, ‘Mini-HYDRA’, of a production CFD simulation code is reviewed. Arithmetic differences lead to divergence in intra-node performance scaling, so the developers had removed some arithmetic from Mini-HYDRA, but this breaks the simulation so limits numerical research. This work restores the arithmetic, repeating validation for similar performance scaling, achieving similar intra-node scaling performance whilst neither are memory-bound. MPI strong scaling functionality is also added, achieving very similar multi-node scaling performance. The arithmetic restoration inevitably leads to different memory-bounds, and also different and varied responses to changes in processor architecture or instruction set. A performance model is developed that predicts this difference in response, in terms of the arithmetic differences. It is supplemented by a new benchmark that measures the memory-bound of CFD loops. Together, they predict the strong scaling performance of a production ‘target’ code, with a mean error of 8.8% (s = 5.2%). Finally, the model is used to investigate limited speedup from vectorisation despite not being memory-bound. It identifies that instruction throughput is significantly reduced relative to serial counterparts, independent of data ordering in memory, indicating a bottleneck within the processor core

Warwick Research Archives Portal Repository

Under the hood of SYCL - an initial performance analysis with an unstructured-mesh CFD application

Author: Jarvis Stephen A.
Mudalige Gihan R.
Owenson A. M. B
Powell Archie
Reguly Istvan Z.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/06/2021
Field of study

As the computing hardware landscape gets more diverse, and the complexity of hardware grows, the need for a general purpose parallel programming model capable of developing (performance) portable codes have become highly attractive. Intel’s OneAPI suite, which is based on the SYCL standard aims to fill this gap using a modern C++ API. In this paper, we use SYCL to parallelize MGCFD, an unstructured-mesh computational fluid dynamics (CFD) code, to explore current performance of SYCL. The code is benchmarked on several modern processor systems from Intel (including CPUs and the latest Xe LP GPU), AMD, ARM and Nvidia, making use of a variety of current SYCL compilers, with a particular focus on OneAPI and how it maps to Intel’s CPU and GPU architectures. We compare performance with other parallelisations available in OP2, including SIMD, OpenMP, MPI and CUDA. The results are mixed; the performance of this class of applications, when parallelized with SYCL, highly depends on the target architecture and the compiler, but in many cases comes close to the performance of currently prevalent parallel programming models. However, it still requires different parallelization strategies or code-paths be written for different hardware to obtain the best performanc

University of Birmingham Research Portal

Warwick Research Archives Portal Repository